Qualcomm AI Engine Direct - calibration thread auto-tuning#18184
Qualcomm AI Engine Direct - calibration thread auto-tuning#18184abhinaykukkadapu merged 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 3 Pending, 2 Unrelated FailuresAs of commit bef50da with merge base eb92cec ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
3b02283 to
02f6db3
Compare
| original_threads = torch.get_num_threads() | ||
| torch.set_num_threads(calib_threads) |
There was a problem hiding this comment.
What does it actually do and mean? How is it different between cpu and gpu? Can we use gpu to calibrate still?
There was a problem hiding this comment.
I checked a bit more and this is what claude said
PyTorch uses a heuristic that depends on the environment:
- Locally / outside containers: It typically defaults to the number of logical CPU cores (os.cpu_count()), which counts hyperthreaded cores.
- In containers / limited environments (like Docker with CPU limits, Kubernetes, or certain cloud VMs): PyTorch tries to respect CPU affinity and cgroup limits, so the thread count may be lower.
- With OpenMP: If PyTorch is compiled with OpenMP (common on Linux), the thread count may be governed by OMP_NUM_THREADS, which, if unset, OpenMP often sets to the logical core count.
It seems like this is specific for PyTorch OpenMP built
There was a problem hiding this comment.
Curious what Qualcomm folks set up is. @haowhsu-quic
There was a problem hiding this comment.
Yeah, in my experiments, the high per iteration time is due to threads waiting at the barrier (you can see the large pillar in the flamegraph from the GH linked issues, it is named mkl_blas_sgemv). This is matrix-vector multiply, specific to decode though as the workloads are smaller due to conv2d kernels, pytorch seems to default high thread counts assuming larger workloads.
@haowhsu-quic can you please pull this PR on top of main (i just merged my coarse + fine pr) and see if tuning works on other vms.
There was a problem hiding this comment.
How about GPU? Does it make a difference?
There was a problem hiding this comment.
Thanks. It looks like the initial thread also use mkl_get_max_threads. Not quite sure how they're different...
Regarding GPU, I think the GPU logic is shared with CPU? Like we can also do model.to("cuda") and do the rest if needed and it goes through the same path. I ran this path a while ago, unsure if it is still works. Just trying to make fewer burden for us to use gpu to calibrate model here
There was a problem hiding this comment.
Not quite sure how they're different
We tune the workload for the host with various number of threads for optimal threads.
There was a problem hiding this comment.
Yeah, in my experiments, the high per iteration time is due to threads waiting at the barrier (you can see the large pillar in the flamegraph from the GH linked issues, it is named
mkl_blas_sgemv). This is matrix-vector multiply, specific to decode though as the workloads are smaller due to conv2d kernels, pytorch seems to default high thread counts assuming larger workloads.@haowhsu-quic can you please pull this PR on top of main (i just merged my coarse + fine pr) and see if tuning works on other vms.
Will test them in the weekend (have other ongoing tasks occupying my machine), please wait for a few days.
There was a problem hiding this comment.
Hi @abhinaykukkadapu, I've tested with 2 machines:
Intel(R) Core(TM) i7-14700 with 28 cores:
| Model | Best Cores | Calibration Time (Hybrid Mode) |
|---|---|---|
| smollm2_135m | 8 | 313s |
| qwen2_5-0_5b | 8 | 289s |
| qwen2_5-1_5b | 20 | 965s |
| qwen3-0_6b | 8 | 1058s |
| gemma3-1b | 20 | 668s |
| smollm3-3b | 20 | 1659s |
| llama3_2-1b_instruct | 20 | 3043s |
| llama3_2-3b_instruct | 20 | 1707s |
(VM) AMD EPYC 7H12 with 16 cores:
| Model | Best Cores | Calibration Time (Hybrid Mode) |
|---|---|---|
| smollm2_135m | 11 | FP Exception |
| qwen2_5-0_5b | 16 | FP Exception |
| qwen2_5-1_5b | 12 | FP Exception |
| qwen3-0_6b | 16 | FP Exception |
| gemma3-1b | 12 | FP Exception |
| smollm3-3b | 8 | FP Exception |
| llama3_2-1b_instruct | 16 | FP Exception |
| llama3_2-3b_instruct | 12 | FP Exception |
The AMD processor has some issue in recent pytorch version which is addressed in #18098.
There was a problem hiding this comment.
@haowhsu-quic thanks for testing it, can you please stamp if you don't have any concerns. Thanks
haowhsu-quic
left a comment
There was a problem hiding this comment.
Impressive, thank you.
AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that samples fractions of the available thread ceiling (1/8 through 1.0) via a quick microbenchmark before prepare_pt2e — no observers exist yet, so synthetic benchmark inputs cannot pollute calibration state. Uses sched_getaffinity when available to respect cgroup/taskset constraints. Thread count is scoped to calibration only and restored after decode calibration phase. CLI override via --calibration_num_threads (0 = auto-tune, default). On a 72-vCPU host, auto-tune selects 18-36 threads depending on the workload, yielding 10.1x faster calibration (21.8 min vs 3h40m) with no PPL regression.
02f6db3 to
bef50da
Compare
TL;DR
Calibration overall time has been cut to near ~10-25 minutes compared to previous 2.5h for various models (10x optimization for decode phase). These optimizations are stacked results from multiple commits. Only remaining bottleneck is the QNN SDK Compile which is opaque to us.
Thread tuning
AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (
os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that sweeps candidate thread counts via a quick microbenchmark and picks the fastest. CLI override via--calibration_num_threads.On a 72-vCPU host, auto-tune selects 18-36 threads, yielding 4.6x faster calibration (24 min vs 1h51m) with no PPL regression.
Calibration times for few models
Llama3.2-1B PPL Validation
cc @cccclai @cbilgin @digantdesai @tanvirislam-meta